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Abstract 

Management of communication by on-line rout¬ 
ing in new FPGAs with a large amount of logic resources 
and partial reconfigurability is a new challenging prob¬ 
lem. A Network-on-Chip (NoC) typically uses packet rout¬ 
ing mechanism, which has often unsafe data transfers, and 
network interface overhead. In this paper, circuit rout¬ 
ing for such dynamic NoCs is investigated, and a practical 
1-dimensional network with an efficient routing algo¬ 
rithm is proposed and implemented. Also, this concept 
has been extended to the 2-dimensional case. The imple¬ 
mentation results show the low area overhead and high 
performance of this network. 


1. Introduction 

The amount of logic resources in FPGAs is growing 
continuously and their dynamic configuration abilities lead 
us to multitasking systems, which need resource and com¬ 
munication management. Different on-line placement algo¬ 
rithms as central part of resource management have been 
proposed and developed □CD However, the communica¬ 
tion management is sitll challenging. In □, the routing 
cost is considered during placement; however, this approach 
does not give any routing algorithm and structure to solve 
it. The issue of communication management has been re¬ 
ferred to a Network-on-Chip (NoC), which is an emerging 
research topic nowadays. 

Networks-on-Chip have been shown to be a good solution 
to support communication on System-on-Chip. Using an 
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on-chip interconnection network to replace top-level global 
routing has the advantages of structure, performance, and 
modularity. A chip employing a NoC is composed of a set 
of network clients like DSP, memory, peripheral controller, 
custom logic, etc. Most of the existing work proposed to¬ 
day uses packet routing for communication between mod¬ 
ules □ 0 However, packet routing has two main disad¬ 
vantages: First, each module has an area overhead, because 
it needs a network interface to split data into packets at the 
source and merge them at the destination. Second, it reduces 
the performance, since if one of the packets is lost, the des¬ 
tination cannot use the transferred data and the lost packet 
must be sent again. The other possibility is circuit routing, 
which establishes a physical connection between source and 
destination by setting required switches. Therefore, com¬ 
pared to circuit routing, packet-based approaches use rout¬ 
ing resources more efficiently by sharing them for different 
connections, but on the other hand, they have a network in¬ 
terface overhead, and low performance data transfer. 

In this paper, circuit routing for NoCs is investigated, and 
for a modified topology from multi-processor network (Re- 
configurable Multiple Bus) { 7 ), a local and efficient circuit 
routing algorithm is presented. This topology has been al¬ 
ready developed as a ring topology with packet switching in 
multicomputer systems, which we adapted to 1-dimensional 
NoCs with circuit switching. We also extended the network 
structure and the routing algorithm to the 2-dimensional 
case. 

The rest of the paper is organized as follows: in Section |2j 
we review briefly existing on-chip network infrastructures. 
Section 0 deals with the Reconfigurable Multiple Bus on 
Chip (RMBoC) structure, and our routing approach. The ex¬ 
tension of RMBoC to 2-dimensional networks is presented 
in Section 0 Section 0 contains the details of our imple¬ 
mentation, and the restrictions and the challenges of imple- 



mentation for using the infrastructure with circuit routing in 
the case of partial reconfiguration is presented in Section[6] 
The results are given in Section^ and Section[8] concludes 
the work and suggests future work. 

2. Interconnection Network Architectures 

The choice of the system-level communication architec¬ 
ture has a significant impact on system performance and en¬ 
ergy consumption. We give a short overview of some pro¬ 
posed interconnection structures for on-chip networks. 

The best-known infrastructure is the bus architecture. The 
Advanced Microcontroller Bus Architecture (AMBA) from 
ARM, the CoreConnect from IBM, and WISHBONE from 
Silicore are some existing bus-based communication archi¬ 
tectures for SoCs. Traditionally, they have been used for 
data path interconnection because of their simplicity. How¬ 
ever, only one module can drive the network at a time. 
Moreover, a bus arbiter is needed when several processors 
attempt to use the bus simultaneously. As a result, all con¬ 
nections must be determined by the arbiter, and then the 
routing approach has low performance, and is not scalable 
0 

A mesh-based interconnection network has been suggested 
for System-on-Chip in 0- where an array of routers inter¬ 
connects an array of processors. The router network has a 
2-dimensional torus topology to limit hardware overhead. 
It has been implemented in a 1-dimensional structure, a 
wormhole routing (which is a packet routing) is adopted. 
Dehon et. al. |6| have proposed a Fat-Tree topology for an 
on-chip interconnection network.There is a unique set of 
switches between any source and sink in this network. For 
finding the routes, it needs still global approaches, and the 
pathfinder algorithm has been used GS- 
Another topology that has been used for NoCs is a hexago¬ 
nal mesh or Honeycomb 0- Each resource is directly con¬ 
nected to three switches and can reach 12 resources with a 
single hop. The main advantages of this topology are that 
fewer hops are needed for connecting resources, and the ra¬ 
tio of resources to the switches being three. 

With the exception of the Fat Tree structure, all of the above 
architectures have been applied to packet switching. The 
Fat Tree topology needs a global routing algorithm for es¬ 
tablishing the connections by finding the shortest path, and 
then reducing congestion of shared segments. 

For reason of highest speed, we want to develop a circuit 
routing for interconnections on chip, which has the follow¬ 
ing features and advantages: 

• The infrastructure, including switches and their con¬ 
nections, occupies a small area (low area overhead). 

• The routing connections can be determined fast and lo¬ 
cally at switches. 


For achieving the mentioned features, we have chosen to use 
the concept of the Reconfigurable Multiple Bus(RMB) Net¬ 
work 0, which is proposed for multi-processor networks. 
We have modified the RMB to use as a Network-on-Chip. 
This interconnection structure is called RMBoC, and ex¬ 
plained in the next section. 

3. RMBoC Structure 

The reconfigurable multiple bus architecture relies on the 
use of an array of parallel bus segments between processing 
nodes. Each processing node can access the reconfigurable 
bus system to communicate with another processing node. 
The bus controller connected to each node coordinates the 
efficient use of available buses through reconfiguration. The 
most important aspect of this architecture is that the recon¬ 
figuration takes place entirely independently of any current 
communication in which the bus segments are involved 171 . 
RMB as a ring-based topology has been proposed to im¬ 
plement a medium-size multi-processor system. The pro¬ 
cessors send messages through RMB using a mechanism 
based on wormhole routing. New channels of communica¬ 
tion are allocated at the top segments. 

For example in Figure 0 first by using highest free seg¬ 
ments, a connection between modules 2 and 5 is established 
and then module 4 is routed to module 1 through highest 
free ones. During the lifetime of this communication, the al¬ 
located channel will be moved down to other free channels. 
This process is called bus compaction, which is used for re¬ 
ducing the establishment time of a connection. 


Connection 2 Connection 1 



Figure 1: Routing Strategy on RMB. 


We have changed three main aspects of this approach: 

• Instead of having a ring-based topology, we use a 1- 
dimensional array. In order to implement a ring on an 
FPGA, global routing lines through the network would 
be required, which prevents dynamic reconfiguration 
or slow wrap-around connections outside of the FPGA 
would be needed. 
































• We do not make any compaction for moving all oc¬ 
cupied segments to the bottom free segments, because 
for compaction, signal assignment conflicts will hap¬ 
pen. 

• We do not use any protocol for packet switching, and 
all the routings are circuit-switched and controlled by 
signaling. 

An RMBoC with n processing elements and k buses is de¬ 
picted in Figure [2] For establishing a connection between 
any two processors, the highest free bus segments will be 
selected dynamically. As shown in Figure [2] four types 
of switches have been used. The basic structure of these 
switches as depicted in Figure [2] is very simple and uses 
very few transistors. 



Figure 2: (a) RMBoC Architecture (b) Basic structure of 
switches in RMBoC. 


This network architecture is appropriate for Xilinx FP- 
GAs that have a column-based configuration architecture 
and can be used as 1-dimensional networks for run-time re¬ 
configuration. Details of this implementation are presented 
in Section|3 

4. Extension of RMBoC 

We have also extended the RMBoC to 2-dimensional 
networks. The main reasons for this extension are: 

• To increase the utilization of FPGA resources, we need 
network architecture similar to popular FPGA archi¬ 
tectures (mesh-based). 

• In order to realize a fully-connected 1-dimensional 
network with n processors, 0{n 2 ) parallel buses are 
needed. For a fully-connected 2-dimensional network 
with n = NxN processors, N x 0(N 2 ) = 0(N 3 ) = 
0(n 15 ) buses are required. Then, bus segments can be 
used more efficiently. 


A 2-dimensional RMBoC with NxN processors and k 
buses in each row and column is shown in Figure^ For es¬ 
tablishing a route, the connections trend to go upward, i.e., 
upward is the first choice in each switch according to the 
destination location. If the destination is located at a lower 
level, the right and leftwards channels will be used, depend¬ 
ing on the sink. Only when a route reaches the same col¬ 
umn as destination and the destination is at a lower level, 
the downward channel is selected. For example, you can 
see in Figure^the routings from .4 to /i and C to D. 



Figure 3: 2-Dimensional RMBoC. 


5. Implementation 

In this section, we present the details of our imple¬ 
mentation on Xilinx FPGAs that have a column-based 
configuration structure. We have focused on implement¬ 
ing the 1-dimensional RMBoC on these. Also, we have 
implemented the 2-dimensional model to analyze the char¬ 
acteristics and resource requirements of the network. 

In this system, the actual crosspoints in one col- 



Figure 4: Implementation of the 1-D RMBoC. 


umn are merged into one controller, which is different 






































































































from the conceptual structure mentioned before (see Fig- 
ure|4j- If the separated crosspoint structure would be used, 
these points would have to communicate with each other 
to find out the free channel, which takes more clock cy¬ 
cles. However, if they are combined into one block, the de¬ 
cision can be done within one clock cycle. Furthermore, 
separate structures need more FIFOs for storing the unpro¬ 
cessed requests, while the combined one requires only one 
FIFO. Therefore, we call the combined switches in one col¬ 
umn crosspoints. 

In our example, the whole system consists of four mod¬ 
ules; each one is a so-called crosspoint. Inside a single 
crosspoint, there are three kinds of structures: a) con¬ 
troller, b) data network and c) FIFOs, as shown in Figure 
13 In the following, the function of these modules is ex¬ 
plained: 



Figure 5: Architecture of a crosspoint. 


Controller: The function of the controller is to trans¬ 
port control commands from one processor to another 
and to configure data channels between processors. In to¬ 
tal, there are four kinds of commands: REQUEST, RE¬ 
PLY, CANCEL and DESTROY. The processors may use 
these commands in the following way: 

First, one processor sends a REQUEST command to 
the corresponding crosspoint with the destination ad¬ 
dress. Then the crosspoint decides in which direction the 
command should be transferred and then written to the out¬ 
put buffer. During this period, no physical channel is cre¬ 
ated, because it is possible that this REQUEST is not be 
confirmed by the destination processor. This does not de¬ 
lay the connection establishment, because the data transfer 
from source cannot be started before getting the acknowl¬ 
edgment of all required segment allocations. When the next 
crosspoint gets the REQUEST, it will check the destina¬ 


tion and then decide to transfer the command to its own cor¬ 
responding processor or the next crosspoint. 

When the destination is reached, the processor gets the re¬ 
quest from the corresponding crosspoint and decides 
whether the channel can be created or not. If so, the RE¬ 
PLY command should be sent; if not, the CANCEL 
command should be sent. When a crosspoint gets a RE¬ 
PLY command, it will search for a free channel. If such 
a channel is available, then the configuration of the phys¬ 
ical data network will be adjusted. If not, a DESTROY 
command will be sent back to the destination automati¬ 
cally to free all the previously created channels and also 
a CANCEL command will be sent to the source auto¬ 
matically to inform that the channel cannot be created. 
When the REPLY command reaches the source proces¬ 
sor, then the complete channel will be created. 

When a crosspoint gets a CANCEL command, it sim¬ 
ply transfers it to the next crosspoint or processor. No mod¬ 
ifications will be done to the configurations. After data 
transfer, the source processor wants to close the data chan¬ 
nel, then the command DESTROY should be sent to the 
crosspoint. When the crosspoint gets the DESTROY com¬ 
mand, it will loop up in its configuration registers, then 
destroy the corresponding channel and transfer the com¬ 
mand to the processor or the next point. 

By using the above protocol, different channels are al¬ 
lowed for each processor at the same time, as long as the 
channels are free. Channels with reversed source and desti¬ 
nation are considered to be different channels. The function 
of the mentioned commands of the controller are summa¬ 
rized as follows: 

- REQUEST : To establish a connection. 

- REPLY : Acceptance of the connection request by the des¬ 
tination. 

- CANCEL : Rejection of the connection request by the des¬ 
tination. 

- DESTROY: Deallocation of the occupied channels of the 
requested connection, when there is no free channel for es¬ 
tablishing the complete path. 

Data network: The function of data network is just to con¬ 
nect corresponding data channels according to the config¬ 
urations modified by the controller. Once the connection 
is established, data is transferred within one clock cy¬ 
cle from source to destination. 

FIFOs: The purpose of the FIFOs is to provide buffer for 
commands. The FIFO selector sends the command from FI¬ 
FOs of each side to the main FIFO. The policy of arbitra¬ 
tion in the FIFO selector is Round-Robin (in order of Left, 
Right, and PE). 

The reason that the main FIFO and a FIFO selector are 
used is that three function blocks would be needed for pro¬ 
cessing commands from left, right and bottom FIFOs. Af¬ 
ter processing the commands, some glue logic, which is 









































































needed to connect to the three blocks, has to be used to de¬ 
cide which block can be written to the output (3 to 1). 
However, the three function blocks are similar, so they 
can be simplified to one block. The simplified block pro¬ 
cesses all the commands from left, right, and bottom. 
So a FIFO selector is needed to collect all the com¬ 
mands from different directions and store them in one 
extra FIFO. Thus, the area of three function blocks is re¬ 
duced to 1/3. 

A single crosspoint consists of the above three struc¬ 
tures, which should be connected to a processor. We have 
made measurements for a system consisting of four mod¬ 
ules, i.e., four crosspoints and four processors. They are 
placed in parallel and connected by so-called bus-macros 
to enable partial reconfigurability. 

In the top-level structure of a 2-dimensional RMBoC, 
there is a total of 16 processors, each of them connect¬ 
ing two crosspoints, one for row transfer and the other 
for column transfer. Consequently, 32 crosspoints are 
used, 16 for row connections and 16 for column con¬ 
nections. The main difference of crosspoints in one and 
two dimensions is the address width. As to the behav¬ 
ior of the processor, now they have to decide which 
crosspoint should be used, the one from row connec¬ 
tion or from column connection. Another task of the 
processor is to transfer commands to other proces¬ 
sors by switching from row connection to column con¬ 
nection, and vice versa. To summarize the communica- 



Figure 6: Processing of command Cj in cp :i . 


tion protocol, main steps of processing of a command 
Cj G {REQUEST, REPLY, CANCEL, DESTROY} 
in the crosspoint cpj to perform required operation(s) 
and generate command Cj+i (Figure |fi}, is given as fol¬ 
lows: 

1 Read the command Cj from the left side FIFO 

2 Write the command Cj in the main FIFO 

3 Read the command Cj by the controller 

4 Determine the new command Cj + and update the con¬ 
figuration of channels if needed 

5 Write the command Cj+i in the right side FIFO 

Steps 1 to 3 each take two cycles, steps 4 and 5 exe¬ 
cute in parallel, and need 2 cycles together. In the best 
case of processing a command, the FIFOs are empty, and 
in 8 cycles (for the 5 steps), the command will be pro¬ 


cessed. 

To analyze the maximum delay for processing a com¬ 
mand, some definitions are required: 

Definition 1 MaxTotalComm is the maximum number of 
commands in all three directions (Left, Right, Processing 
Element). 

Theorem 1 The maximum required processing time is 
(MaxTotalComm — 1) x 4 + 4 cycles. 

Proof: The main steps will be executed in a pipeline model, 
in which steps 1 and 2 are in the first stage, and the rest in 
the second stage, therefore the waiting time would be four 
cycles for each command in thefifos, and four cycles for the 
second stage of the pipeline. 

Theorem 2 The MaxTotalComm is equal to |~ — + 2 TO ~ * 1 2 3 4 5 ]- 

Proof: To compute MaxTotalComm, the maximum num¬ 
ber should be computed separately in each direction. For 
example in crosspoint cpj (Figurethe maximum number 
of requests from the right side would be that all the mod¬ 
ules on the right ofcpj send commands to all processing el¬ 
ements on the left ofcpj including PE j : (n—j) xj. We can 
compute the maximum in similar ways for the other two di¬ 
rections. From the left side, at most (j — 1) x (n — j + 1) 
commands and from PE, at most (n — 1) commands can 
be received. As can be seen, the maximum numbers of re¬ 
quests from right and left sides are dependent on the po¬ 
sition of crosspoint cpj. It can be easily proved that these 
two numbers are maximized when j = n/2, and then 
MaxTotalComm = \ — + ^ n ~ 4 ]. 

It should be noted that this maximum delay for a command 
can happen only once through the network: When a com¬ 
mand reaches its next crosspoint, and the commands ahead 
of it have already been processed, then no waiting cycle 
arises. Moreover, it has to be guaranteed that the number of 
commands in each direction is restricted to the depth of FI¬ 
FOs; if the FIFO depth is smaller than the maximum num¬ 
ber of simultaneous commands in a direction, then some 
commands can be lost, and the source has to send them 
again. 

6. Dynamic Reconfiguration Challenges 

For enabling partial reconfiguration in the RMBoC struc¬ 
ture, hard busmacros are used to fix the communications 
problem of crosspoints during reconfiguration. The bus- 
macro ensures the reproducibility of the design routing and 
is implemented using tri-state buffers. The tri-state buffers 
force the routing to always pass through the same places. At 
the same time they decouple the modules from each other 
during reconfiguration, avoiding possibly harmful transi¬ 
tory situations. In this way, a 4 bit data bandwidth per row 




















communication channel is possible between adjacent mod¬ 
ules. This limitation comes from the current Virtex archi¬ 
tecture and its limited routing resources. 

Some problems should still be considered. For example in 
Figure EJ assume there are connections between PE 1 and 
PE3, and also between PEI and PEA. If now PE2 has 
to be reconfigured, what has to happen with the configura¬ 
tion of the CP21 

Virtex II (Pro) devices offer glitchless partial reconfigura¬ 
tion. If a configuration bit holds the same value before and 
after configuration, there will be no glitch on the resource 
that bit controls. Resources requiring special attention are 
SRL16s and LUT RAMs, because they change dynamically 
and will be overwritten when configuration occurs (4j. 
Therefore the data on the segments will be sent without any 
glitch. The only remaining problem concerns the state val¬ 
ues of the crosspoint. Since these values will be lost during 
reconfiguration, if they are saved in LUTs, we use Block- 
RAMs that are distributed in six regions of the FPGA area. 
This means that we cannot use more than six modules in the 
RMBoC, and four modules for the appropriate structure. 
The other problem that can arise with dynamic reconfig¬ 
uration is loss of non-completed connection requests. For 
example in Figure 0 PEA requests for a connection with 
PE 1, and this request occupies a free segment in CP 3 for 
this connection. Before allocating a free segment in CP 2, 
and exactly at the same time when the request in CP2 is 
read from, PE2 will be reconfigured. Then the information 
of this request will be lost, and the whole request times out, 
because the source does not receive any acknowledgement 
from destination. The source sends the request again, but a 
free segment in CP 3 is occupied uselessly because of the 
non-completed request. To solve this problem at each cross- 
point, only one channel for requests with the same source 
and destination should be allocated. 

Also by reconfiguration of modules, the DESTROY com¬ 
mand may be lost, and the destruction of the connection 
will not be completed. Therefore, an additional command 
CONFIRM is added to acknowledge completion of connec¬ 
tion destruction. Obviously if the source does not receive 
the CONFIRM from destination, it will initiate a new DE¬ 
STROY command. 

7. Analysis Results 

After the implementation, we compared the area over¬ 
head and performance of the RMBoC. As shown in Tabled 
the 1 -dimensional RMBoC has been implemented on a Vir¬ 
tex II 6000 with four processors (n = 4) and four parallel 
buses ( k = A), with a data bandwidth of 16 bits (w = 16). 
The area overhead grows and the maximum frequency de¬ 
creases with increasing data bandwidth. The area overhead 
range is relatively low (from 4% to % 15 of FPGA area), and 


the reachable frequency is about 120 MHz. 

Also to analyze the behavior of our design by increas¬ 
ing the number of segments or data bitwidth, we have com¬ 
pared them such that the whole maximal bandwidth of the 
network (fc x w) is fixed (32). As depicted in Figure^ by de¬ 
creasing the number of segments k and increasing the seg¬ 
ment bitwidth w, the utilized area stays nearly constant, but 
the performance of the design improves. On the other hand, 
by integrating the narrow segments into a wide segment, 
bitwidth reduces the flexibility and possibility of establish¬ 
ing different connections simultaneously. As a case study 
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Figure 7: Tradeoff of number of segments k, and data bit 
width w with n = A fixed number of modules and k x w 
fixed maximal bandwidth. 


to inspect probable communication defects, we have im¬ 
plemented a video application with a VGA controller run¬ 
ning at 25Mhz for normal 640x480 VGA. A color gener¬ 
ator module (CG) communicates with the VGA controller 
(VC). The color generator gets the X and Y coordinates 
of the current pixel position from the VGA module, com¬ 
putes the color to be placed at that position and sends it 
back to the VGA module, which displays the color at the 
corresponding position. The color generator application is a 
nice method to detect changes in the communication, beca- 
sue this will directly have a visual effect on the screen. The 
X- and Y-positions are each 12 bits wide and the color is 24 
bits wide. This application works well and without commu¬ 
nication problems. 

We have also investigated the characteristics of 
2-dimensional RMBoC, and the results are presented in Ta¬ 
ble □ The area overhead seems to be too large (more than 
50% of the FPGA area) for practical use, but the maxi¬ 
mum frequency is still high (85-96 MHz). Actually for 
using 2-dimensional circuit routing in future FPGA, the on¬ 
line routing should be done in an additional layer, therefore 





































































DataWidth(bit) w 

Slices used# 

Slices Used% 

4-input LUTs used# 

4-input LUTs used% 

Max frequency(MHz) 

1 

1367 

4 

2074 

3 

105 

8 

2100 

6 

3856 

4 

103 

16 

3407 

10 

6108 

9 

99 

32 

5084 

15 

9502 

14 

94 


Table 1: Area overhead and performance of 1-dimensional RMBoC with n = 4 modules and k = 4 segments per module. 


DataWidth(bit) w 

Slices used# 

Slices Used% 

4-input LUTs used# 

4-input LUTs used% 

Max frequency(MHz) 

8 

17192 

50 

32433 

48 

96 

16 

26762 

61 

37607 

56 

91 

32 

28156 

83 

53872 

79 

85 


Table 2: Area overhead and performance of 2-dimensional RMBoC with n = 16 modules and k = 4 segments per module 
and direction. 


the area overhead will not be a bottleneck. 

8. Conclusion 

In this paper, we have investigated online circuit rout¬ 
ing, in particular for dynamic reconfigurable devices. As a 
practical solution for Xilinx FPGAs, we propose a RMBoC 
network that has a low area overhead and works with high 
frequencies. This solution has been implemented; it works 
completely on Xilinx FPGAs. In addition, we have extended 
the RMBoC concept to a 2-dimensional one, at the expense 
of a considerable amount of area. On the other hand, this 2- 
dimensional network yields a high performance, which can 
be useful for future generations of FPGAs. 
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